Remove duplicate search entries #981

BPerlakiH · 2024-09-15T11:57:35Z

Fixes: #980
Superseeds #979

What we really want is unique search results, which is in fact unique URLs that the user can click on and navigate further. If 2 search results are linking to the same URL, that's a duplicate.

Solution

Store a set of foundURLs as we are processing the search results (both indexed and title search), and look for duplicates. Also make sure we adjust the URL endings to always contain the trailing "/" slash, but this is only for our comparison, the returned search result URLs should not be modified.

After the fix on macOS:

codecov-commenter · 2024-09-15T12:01:44Z

Codecov Report

Attention: Patch coverage is 2.08333% with 47 lines in your changes missing coverage. Please review.

Project coverage is 39.19%. Comparing base (9093134) to head (2065328).

Files with missing lines	Patch %	Lines
SwiftUI/Model/SearchOperation/SearchOperation.mm	2.08%	47 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #981      +/-   ##
==========================================
+ Coverage   37.23%   39.19%   +1.96%     
==========================================
  Files         111      111              
  Lines        6346     6360      +14     
==========================================
+ Hits         2363     2493     +130     
+ Misses       3983     3867     -116

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

HiroyasuNishiyama · 2024-09-15T12:25:48Z

@BPerlakiH Thank you for your fix.

I think one issue is how to handle the score of search results when there are duplicate entries.
In the original code, title search results are ranked higher than index search results and are displayed higher in the result list.
Since the title search result score seems more natural, I tried to reflect that rather than simply dropping it in my fix.

BPerlakiH · 2024-09-15T18:21:43Z

Thank you @HiroyasuNishiyama, that is a good point.
I did addressed 2 concerns in my latest changes, one described in the issue itself, about failing archives, and loosing results on those.

I also changed the order of execution, so we do the title search first, and then the indexed search, this way if we drop duplicates we do that from the indexed ones, leaving the probability: null, which I think would be equivalent to your solution.

Please note that we do another sorting on these results here:

kiwix-apple/SwiftUI/Model/SearchOperation/SearchOperation.swift

Lines 48 to 57 in 9093134

    
           // calculate score for all results 
        
           for result in results { 
        
               guard !isCancelled else { return } 
        
               let distance = WagnerFischer.distance(result.title.lowercased()[...], searchText[...]) 
        
               if let probability = result.probability?.doubleValue { 
        
                   result.score = NSNumber(floatLiteral: Double(distance) * Foundation.log(7.5576 - 6.4524 * probability)) 
        
               } else { 
        
                   result.score = NSNumber(integerLiteral: distance) 
        
               } 
        
           }

HiroyasuNishiyama · 2024-09-16T00:28:46Z

@BPerlakiH
I think simply changing the order to do title search first introduces another issue.
The results of title search do not include snippets, so if there is a duplicate entry, the snippet will be dropped from the result.

And I realized that my fix had a misunderstanding.
To match the current search scores, I think the probability value should be set to the larger of 0.75 (i.e. the solution of log(7.5576 - 6.4524 * x)=1) or the title search result probability rather than setting it to nil.
Though, I know it's ugly to introduce such a constant into this code.

FYI, no exceptions occur during index searches in my test environment.

rgaudin · 2024-09-16T10:19:00Z

I think it would be good to have @kelson42's opinion here, in regard to how this should compare with other readers.

kelson42 · 2024-09-16T11:46:42Z

At this stage, I would recommend to stop investing more time on this PR. Like I suspected, the solution is not straight.

We should think out of the box about this. IMO whatever we decide do, this is not the role of the reader to do this kind of "smart" computations.

HiroyasuNishiyama · 2024-09-17T12:52:58Z

As a user, I like the current Apple Kiwix search interface, where I do not have to worry about the indexing method.
And I would prefer that redundant entries be excluded from the results from the search results.
Of course, it is up to your team to decide what design choices to make.

For reference, I would like to leave my current thoughts on this PR.
My first PR and this PR removes redundant search results in the Objective C++ module (SearchOperation.mm).
Since this causes issues in handling the deletion results as mentioned above, I think it would be better to delete identical entries so that the first entry remains after the sorting process in swift (SearchOperation.swift) shown by @BPerlakiH .
As an addition, we need to add a process to leave the snippet of the removed entry.

BPerlakiH · 2024-09-17T20:12:47Z

Ok, to sum up what's in this PR already:

removing duplicate entries from search results (I reversed the title/index now, so snippets are there)
try / catching errors on searches by each ZIM file (and search type), this way we avoid no results (or less results) if one of the ZIM files is invalid.

For future discussion:

we could improve to use only one way of searching, if there are absolutely no results from that, maybe only in that case fall back to another type of search
possible improvement on the OpenZIM side, is to have snippets for title searches as well, eg. returning the first sentence / paragraph of the article (currently we get back the title as the snippet, which we can not / do not use).

IMHO: this PR is already an improvement from what we had. I also do agree with @kelson42 that this is still not an ideal place to be, and a broader / more universal approach is needed for search handling, that would unify (cross platform) and simplify the reader side.

kelson42 · 2024-09-18T09:15:55Z

@BPerlakiH @rgaudin @HiroyasuNishiyama I will merge this PR because it seems to fix the bug reported originaly. For the long standing problem/improvement, I have created a dedicated issue.

BPerlakiH added bug macOS iOS labels Sep 15, 2024

BPerlakiH added this to the 3.6.0 milestone Sep 15, 2024

BPerlakiH requested review from rgaudin and kelson42 September 15, 2024 11:57

BPerlakiH linked an issue Sep 15, 2024 that may be closed by this pull request

Duplicate entries in search results #980

Closed

BPerlakiH temporarily deployed to internal September 15, 2024 11:57 — with GitHub Actions Inactive

BPerlakiH mentioned this pull request Sep 15, 2024

Remove duplicate entry from search results #979

Closed

BPerlakiH temporarily deployed to internal September 15, 2024 18:18 — with GitHub Actions Inactive

BPerlakiH temporarily deployed to internal September 17, 2024 07:24 — with GitHub Actions Inactive

BPerlakiH added 3 commits September 18, 2024 09:17

Remove duplicate search entries

6042841

Archive fault tolerant search, preferring title results

c5cd822

Revert title index search order

990e51f

kelson42 force-pushed the 980-remove-duplicate-entry-from-search-results branch from 2065328 to 990e51f Compare September 18, 2024 09:17

kelson42 temporarily deployed to internal September 18, 2024 09:17 — with GitHub Actions Inactive

kelson42 removed this from the 3.6.0 milestone Sep 18, 2024

kelson42 merged commit c9043bf into main Sep 18, 2024
4 checks passed

kelson42 deleted the 980-remove-duplicate-entry-from-search-results branch September 18, 2024 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove duplicate search entries #981

Remove duplicate search entries #981

BPerlakiH commented Sep 15, 2024 •

edited by kelson42

Loading

codecov-commenter commented Sep 15, 2024 •

edited

Loading

HiroyasuNishiyama commented Sep 15, 2024

BPerlakiH commented Sep 15, 2024 •

edited

Loading

HiroyasuNishiyama commented Sep 16, 2024 •

edited

Loading

rgaudin commented Sep 16, 2024

kelson42 commented Sep 16, 2024

HiroyasuNishiyama commented Sep 17, 2024

BPerlakiH commented Sep 17, 2024

kelson42 commented Sep 18, 2024

Remove duplicate search entries #981

Remove duplicate search entries #981

Conversation

BPerlakiH commented Sep 15, 2024 • edited by kelson42 Loading

Solution

After the fix on macOS:

codecov-commenter commented Sep 15, 2024 • edited Loading

Codecov Report

HiroyasuNishiyama commented Sep 15, 2024

BPerlakiH commented Sep 15, 2024 • edited Loading

HiroyasuNishiyama commented Sep 16, 2024 • edited Loading

rgaudin commented Sep 16, 2024

kelson42 commented Sep 16, 2024

HiroyasuNishiyama commented Sep 17, 2024

BPerlakiH commented Sep 17, 2024

kelson42 commented Sep 18, 2024

BPerlakiH commented Sep 15, 2024 •

edited by kelson42

Loading

codecov-commenter commented Sep 15, 2024 •

edited

Loading

BPerlakiH commented Sep 15, 2024 •

edited

Loading

HiroyasuNishiyama commented Sep 16, 2024 •

edited

Loading